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Abstract: We propose two nonlinear Kalman smoothers that rely on Student's t distributions. 
The T-Robust smoother finds the maximum a posteriori likelihood (MAP) solution for Gaussian 

process noise and Student's t observation noise, and is extremely robust against oiitlicrs, 
outperforming the recently proposed £i-Laplacc smoother in extreme situations (e.g. 50% or 
more outliers). The second estimator, which wc call the T- Trend smoother, is able to follow 
sudden changes in the process model, and is derived as a MAP solver for a model with Student's 
t-process noise and Gaussian observation noise. We design specialized methods to solve both 
problems which exploit the special structure of the Student's t-distribution, and provide a 
convergence theory. Both smoothers can be implemented with only minor modifications to 
an existing L2 smoother implementation. Numerical results for linear and nonlinear models 
illustrating both robust and fast tracking applications are presented. 
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1. INTRODUCTION 

The Kalman filter is an efficient recursive algorithm that 
estimates the state of a dynamic system from measure- 
ments contaminated by Gaussian noise [Kalman, I960]. 
Along with many variants and extensions, it has found 
use in a wide array of applications including navigation, 
medical technologies and econometrics [Chui and Chen, 
2009, West and Harrison, 1999]. Many of these problems 
are nonlinear, and may require smoothing over past data 
in both online and offline applications to improve signifi- 
cantly the estimation performance [Gelb, 1974]. 
In this paper, we focus on two important areas in Kalman 
smoothing: robustness with respect to outliers in mea- 
surement data, and improved tracking of quickly changing 
system dynamics. Robust smoothers have been a topic of 
significant interest since the 1970's, e.g. see [Schick and 
Mitter, 1994]. In order to design outlier robust smoothers, 
recent reformulations described in [Aravkin et al., 2011b, a, 
Farahmand et al., 2011] use Li, Huber or Vapnik loss 
functions in place of L2 penalties. There have also been 
recent efforts to design smoothers that arc able to bet- 
ter track fast system dynamics, e.g. jumps in the state 
values. For example, in [Ohlsson et al., 2011] the Laplace 
distribution is used in place of the Gaussian distribution 
to model transition noises. This introduces an Li penalty 
on the state evolution in time and can be interpreted as 
a dynamic version of the well known LASSO procedure 
[Tibshirani, 1996]. 

All of these smoothers can be derived from a statistical 
point of view, using log-concave densities on process noise. 



measurement noise, and prior information on the state. 
Log-concave densities take the form 

p(-) cx exp{—p{-)), p convex . (l-l) 
Formulations using (1.1) are nearly ubiquitous, in part be- 
cause they correspond to convex optimization problems in 
the linear case. However, to effectively model a regime with 
large outliers or sudden jumps in the state, we want to look 
beyond (1.1) in order to allow heavy-tailed distributions. 
A particularly convenient heavy-tailed modeling distri- 
bution that falls outside of (1.1) is the Student's t- 
distribution. In the statistics literature, this distribution 
was successfully applied to a variety of robiist inference ap- 
plications [Lange et al., 1989], and is closely related to re- 
descending influence functions [Hampel et al., 1986]. In the 
context of Kalman filtering/smoothing, the idea of using 
Student's t-distributions to model both measurement error 
and innovations (process errors) was studied in [Fahrmeir 
et al., 1998]. 

We propose new nonlinear Kalman smoothers which we 
call T-Robust and T- Trend. For the T-Robust smoother, 
we model the measurement noise using the Student's t- 
distribution, extending the approach in [Aravkin et al., 
2011b] to the Student's t. As a result, errors or 'outliers' 
in the measurements have even less effect on the smoothed 
estimate, and performs better than [Aravkin et al., 2011b] 
for cases with high proportion of outliers (e.g. 50% bad 
data). For the T- Trend smoother, wc instead model pro- 
cess noise using the Student's t-distribution, which allows 
the smoother to track sudden changes in state. 
Our work differs for [Fahrmeir et al., 1998] in two im- 
portant respects. First, we include nonlinear measurement 



and process models in our analysis. Second, we propose 
a novel optimization algorithm to solve both T-Robust 
and T-Trend smoothing problems. This algorithm differs 
significantly from the one proposed in [Fahrmeir et al., 
1998]. In particular, rather than using the Fisher infor- 
mation matrix (i.e. full Hessian), or its expectation as 
a Hessian approximation (method of Fisher's scoring) as 
suggested by [Fahrmeir et al., 1998], we design a modified 
Gauss-Newton method which builds information about the 
curvature of the Student's t-log likelihood into the Hessian 
approximation, and provide a convergence theory. The 
fast smoothing procedures proposed here also allow the 
efficient estimation of hypcrparamctcrs such as degrees 
of freedom (e.g. cross-validation techniques using the EM 
algorithm are discussed in [Fahrmeir et al., 1998]). 
The paper is organized as follows. In section 2 we introduce 
the multivariate Student's t-distribution and define the 
models underlying the T-Robust and T-Trcnd smoothers. 
In Section 3 the maximum likelihood objective for T- 
Robust, the quadratic approximation for this objective, 
and the convex quadratic program to solve this approxi- 
mate subproblem are reported. In Section 4 the T-Trend 
smoother is designed by modeling transitions using Stu- 
dent's t. We describe our algorithm, provide a convergence 
theory, and explain the differences with [Fahrmeir et al., 
1998] in 5. The T-Robust and T-Trend smoothers are 
tested using simulated data for linear and nonlinear models 
in Section 6. We end the paper with some concluding 
remarks. 

2. THE T-ROBUST AND T-TREND SMOOTHING 
PROBLEMS 

For a vector u € M" and any positive definite matrix 
M e M"^", let ||u||m := Vu^Mu. We use the following 
generalization of the Student's t-distribution: 



p(t^felM) 



r(^) 



r(f)det[7rsi?]i/2 



1 



\\vk 



(2.1) 



where fj, is the mean, s is the degrees of freedom, m is the 
dimension of the vector Vk , and R is a positive definite ma- 
trix. A comparison of this distribution with the Gaussian 
and Laplacian distribution appears in Figure 2. Note that 
the Student's t-distribution has much heavier tails than 
the others, and that its influence function is re-descending, 
see [Maronna et al., 2006] for a discussion of influence 
functions. This means that as we pull a measurement 
further and further away, its 'influence' decreases to 0, so it 
is eventually ignored by the model. Note also that the £i- 
Laplace is peaked at 0, while the Student's t-distribution 
is not, and so a Student's t-fit will not in general drive 
residuals to be exactly 0. 

We use the following general model for the underlying 
dynamics: for A; = 1, . . . , iV 

Xk = gk{xk-i) + Wk /2 2) 

Zk = hk{zk)+Vk ^ ' ' 

with initial condition gi{xQ) = rjQ + wi, with go ^ known 
constant, and where gk ■ K" — >■ K" are known smooth 
process functions, and /ife : R" ^ K^C^) are known smooth 
measurement functions. 

For the T-Robust smoother, we assume that the vector 
Wk € M" is zero-mean Gaussian noise of known covariance 



Qk E M"^", and the vector Vk E M^C^) is zero-mean 

Student's t measurement noise (2.1) of known covariance 

Rk e W"ik)xm{k) degrees of freedom s. 

For the T-Trend smoother, the roles are interchanged, i.e. 

Wk is Student's t noise while Vk is Gaussian. In both cases, 

we assume that the vectors {wk} U {vk} are all mutually 

independent. 

In the next sections, we design methods to find the MAP 
estimates of {xk} for both formulations. 

3. T-ROBUST SMOOTHER 

Given a sequence of column vectors {uk} and matrices 
{Tfc} we use the notation 
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We also make the following dcflnitions: 
R = diag({ii'fc}) 
Q = diag({gfc}) 
X = vec({xfe}) 

Maximizing the likelihood for the model (2.2) is equivalent 
to minimizing the associated negative log likelihood 

-\ogp{{vk}:{wk}) = -logp({:/fc}) - logp({wfc}) 

Dropping the terms that do not depend on {xk}, the 
objective corresponding to T-Robust is 
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, (3.1) 



where .s^'s are degrees of freedom parameters associated 
with measurement noise, and are the dimensions of 
the fcth observation. 

A first-order accurate affine approximation to our model 
with respect to direction d = vec{dfc} near a fixed state 
sequence x is given by 

w{x\d) = Yec{{xk - gk{xk-i) - 9k^{xk-i)dk}), 
v{x; d) = Yec{{zk - hk{xk) - h^k\^k)dk})- 

Set Qn+1 = Ifi a-nd gN+i{xN) = (where /„ is the n x n 
identity matrix) so that the formulas are also valid for 
k = N + l. 

We minimize the nonlinear nonconvex objective in (3.1) 
by iteratively solving quadratic programming (QP) sub- 
problems of the form: 

min ^d'^Cd + a^d w.r.t d e M"^, (3.2) 

where a is the gradient of objective (4) with respect to x 
and C has the form 



C = 







A2 C2 

'•. '•• 







(3.3) 



An Cn 

with Afe e K"^" and e R"^" defined as follows: 




Fig. 1. Gaussian, Laplace, and Student's t Densities, Corresponding Negative Log Likelihoods, and Influence Functions 
(for scalar Vk)- 



Hu = 



{Sk + ||wfc|| + mfe) 

The solutions to the subproblcm (3.2) have the form d = 
—C~^a, and can be found in an efficient and numerically 
stable manner in 0{rt'N) steps, since C is tridiagonal and 
positive definite (see [Bell ct al., 2008]). 

4. T-TREND SMOOTHER 

The objective corresponding to T-Trend is 

|2 



1 ^ 
/c=l 
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(4.1) 



where arc degrees of freedom parameters associated 
with process noise, and n is the dimension of each state Xk- 
A first-order accurate affinc approximation to our model 
with respect to direction d = vec{rffe} near a fixed state 
sequence x is as follows: 

w{x\d) = vec{{xk - gk{x-i) - g^k\xk-i)dk}), 
v{x;d) = vec({2;fc - hkixk) - h^k i^k)dk})- 

As before, we set Qn+i = In and gN+i{xN) = (where 
/„ is the nx n identity matrix) so that the formula is also 
valid for fc = A'' + 1. 

We minimize the nonlinear objective in (4.1) by itcrativcly 
solving quadratic programming (QP) subproblems of the 
form 

min ^d'^Cd + a^d w.r.t d G K"^, (4.2) 

where a is the gradient of objective (4) with respect to x 
and C again has form (3.3), but now with Ak G M"^" and 
Cfc e K"^" defined as follows: 



Ak = - 



{rk+n)Q-'gi'^ 



Tk + ||w.'fe+l||^^i ' 

a))To-if,(i) 

^dk+iV' Q k+i9k+i 
irk + \\wk\\%-^)/{rk+ny 



Ck = Qk' + ih'k'VRk'K 
,,.(1) .(1) 

Hk = 



Hk 



(4.3) 



The solutions to the subproblem (4.2) again have the 
form d = —C~^a, where C is tridiagonal and positive 



definite, so that they still can be found in an efficient and 
numerically stable manner in 0{n^N) steps, see [Bell et al., 
2008]. 

5. ALGORITHM AND GLOBAL CONVERGENCE 

When models gk and hk are all linear, we can compare 
the algorithmic scheme proposed in the previous sections 
with the method in [Fahrmeir et al., 1998]. The method 
in [Fahrmeir et al., 1998] proposes using the Fisher infor- 
mation matrix in place of the matrix C above. When the 
densities for Wk and Vk are Gaussian, this is equivalent to 
the Gauss-Newton method. However, in the Student's t- 
case, the Fisher information matrix may be indefinite. 
When this occms, [Fahrmeir ct al., 1998] propose to use 
the expectation of the Fisher information matrix, i.e. the 
Fisher scoring method. In this approach the expected value 
of C replaces the terms [[wfejll or ||t^fc||2 in the denomina- 
tors of Hk and Ak with their expectations, which are only 
functions of Sk and rk , the degrees of freedom. The crucial 
difference here is that in practice, we want to pass the 
information about which \\wk\\ and \\vk\\ are large to the 
algorithm, so that it can curtail their contribution to the 
model updates. So while the Fisher information matrix is 
too unstable (can become; indefinite), the expected Fisher 
information is insensitive to the magnitude variations the 
algorithm should incorporate as it proceeds. 
For these reasons, we propose a Gauss-Newton method 
which uses the relative size information of the residuals 
to find the directions of descent, and provide a proof of 
convergence for the application of this method to solve 
(3.1). This objective takes the form K = p o F, with the 
convex function p and the smooth function F given by 



F{x) = 



.1,1 

/(■^) 

r(x) 
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fe=i 



1 + 



\\Pk{x) 



h 



(5.1) 
(5.2) 

, (5.3) 



where r{x) and p{x) are smooth functions of x, A and 
{Bk} are known positive definite matrices, bk,lk are fixed 
constants, and c € M. This structure covers both T- 
Robust and T-Trend, where for T- Robust pk{x) = Vk{x), 
r{x) = w{x), A ~ diag[{Qk}] and Bk = Rk, and for T- 
Trend Pk{x) = Wk{x), r{x) = v{x), A = diag[{i?fe}] and 
Bk = Q)k. Note that the range of / is R+. 



The objective K{x) = poF(x) is convex composite, since p 
is convex and F{x) is smooth. Our approach exploits this 
structure by iteratively linearizing F about the iterates x'^ 
and solving 

min p{F(x^)+F'^^\x'')d). f5 4) 

Rather than requiring interior point methods to solve (5.4) 
as in [Aravkin et al., 2011b], a single block-tridiagonal 
solve of the system (3.2) yields descent direction d for the 
objective K{x). While the details of solving the direction 
finding subproblcm are always problem specific, a general 
convergence theory for convex-composite methods can 
be derived to establishes the overall convergence to a 
stationary point of K{x). The theory required generalizes 
that of [Aravkin et al., 2011b], and includes both T- 
Robust and T- Trend formulations. We present the theory 
and remark on its particular application to the problems 
of interest. Please see [Aravkin, 2010, Theorem 4.5.2, 
Corollary 4.5.3] for the proofs. 

Recall the first-order necessary condition for optimality in 
the convex composite problem minimize K(x) is 

e dK{x) = dp{F{x))F^^\x) 
where dK{x) is the generalized subdifferential of at a; 
Rockafellar and Wets [1998] and dp{F{x)) is the convex 
subdifferential of p at F{x) Rochafellar [1970]. Elementary 
convex analysis gives us the equivalence 

e dK{x) ^ K{x) = inf p ( F{x) + F'^^^ {x)d ) . 

For both T-Robust and T-Trend, it is desirable to modify 
the objective in (5.4) by including curvature information. 
We therefore define the difference function 

A(a;, H;d) = p(^ F{x) + F^^'> {x)d ) + ^d^Hd - K{x) , 

(5.5) 

where H = H{x) is positive scmidefinitc and varies 
continuously with x, and the minimum of A(x, ff; d) with 
respect to direction d 

A* {x,H) = inf A{x,H- d) . (5.6) 

d 

Since Hd is difFerentiable at the origin for any H, we 
have A* (a;, H) = ii and only if € dK{x) Burke [1985]. 
Given r] G (0, 1), we define a set of search directions at x 
by 

D{x,H,r]) = {d\ A{x, H; d) < ryA* {x, H) } . (5.7) 
Note that if there is a d e D{x,H,i]) such that 
A{x,H; d) > -r/e, then A*{x,H) > -e. These ideas 
motivate the following algorithm Burke [1985]. 
Algorithm 5.1. Gauss-Newton Algorithm. 

(1) Given G M^" an initial estimate of state sequence, 
Hq € §" the initial curvature information, e > an 
overall termination criteria, rj g (0, 1) a termination 
criteria for subproblem, P e (0, 1) a line search 
rejection criteria, 7 € (0, 1) a line search step size 
factor. Set the iteration counter v = 0. 

(2) (Gauss- Newton Step) Find d" in the set Dix",!!,,, rf) 
in 5.7. Set A^, = A{x'^ , H^;d^) in 5.5 and Terminate 
if A^ > -£. 

(3) (Line Search) Set 

ti, = max 7* 

s.t. i e {0,1,2, •••} and 

s.t. p [Fix-" + ^'d")) < p {Fix")) + /Jf A,. 



Table 1. Median MSE over 1000 runs and 
intervals containing 95% of MSE results 



Outlier 


P 


KS MSE 


RKS MSE 


TKS MSE 


Nom. 




.04(.02, .1) 


.04(.01, .1) 


.04(.01, .09) 


N(0,10) 


.1 


.17(.05, .55) 


.05(.02, .13) 


.04(.02, .11) 


N(0, 100) 


.1 


1.3(.30, 5.0) 


.05(.02, .14) 


.04(.02, .11) 


U(-10, 10) 


.1 


.47(.12, 1.5) 


.05(.02, .13) 


.04(.02, .10) 


N(0,10) 


.2 


.32(.ll, .95) 


.06(.02, .19) 


.05(.02, .16) 


N(0, 100) 


.2 


2.9(.94, 8.5) 


.07(.02, .22) 


.05(.02, .14) 


U(-10, 10) 


.2 


1.1(.36, 3.0) 


.07(.03, .26) 


.05(.02, .13) 


N(0, 10) 


.5 


.74(.29, 1.9) 


.13(.05, .49) 


.10(.04, .45) 


N(0, 100) 


.5 


7.7(2.9, 18) 


.21(.06, 1.6) 


.09(.03, .44) 


U(-10, 10) 


.5 


2.6(1.0, 5.8) 


.20(.06, 1.4) 


.10(.03, .44) 



(4) (Iterate) Set x''+'^ = x^+t^d", select H^+i e and 
goto Step 2. 

Theorem 5.2. Let K{x) = p o F{x), with p be convex 
and coercive on its domain, F continuously differentiable, 
and further assume that x^ € M^" is such that F^^^ is 
uniformly continuous on the set co ({x | K{x) < K{x°) }). 
Fix a;° e K^", define 

A={u\ p{u) < K{x°) } . 

Suppose that F~^{h) = {x \ F{x) G A} is bounded, and 
either of the following assumptions hold: 

< A„,i„ < eig{H'') (5.8) 

Null (f^^\x)^ = {0} V a; e F"^(A) . (5.9) 

If {x'^} is a sequence generated by Algorithm 5.1 with 
initial point and e = 0, then {.x''^} and {(i'^} arc bounded 
and either the algorithm terminates finitely at a point x'^ 
with e dK{x'^), or Aj, — >■ as z/ ^ 00, and every 
cluster point x of the sequence {x^} satisfies E dK{x). If 
{(x'^, i/jx)} is a sequence generated by the Gauss-Newton 
algorithm above with initial point (x°,i?o) and £ = 0. 
If the sequence {Hi,} remains bounded, then {x'^} and 
{d'^} are bounded and either the algorithm terminates 
finitely at a point x'^ with G dKix"), or A^y as 
— >■ 00. Moreover, every cluster point x of the sequence 
{x"} satisfies e dK{x). 

Remark 5.3. Both T- Robust and T- Trend can be solved 
by Algorithm 5.1, with the objective function K{x) as 
above. Note that p above is coercive on the range of F. 
For T-Robust, (5.9) always holds, since of F'^^' contains 
a block-bidiagonal matrix with identities on the diagonal. 
For T-Trend, (5.8) holds if all g^\xk) are nonsingular for 
all a; e F-i(A); see (4.3). 

6. NUMERICAL EXPERIMENTS 
6.1 T-Robust Smoother 

Linear Example In this section we compare the new T- 
robust smoother with the i2-Kalman smoother [Bell et al., 
2008] and with the ^i-Laplace robust smoother [Aravkin 
et al, 2011b], both implemented in [Bell et al., 2007-2011]. 
The ground truth for this simulated example is 

T 

x{t) = [— cos(f) — sin(f)] . 

The time between measurements is a constant At. We 
model the two components of the state as integral and 
two-fold integral of the same white noise, so that 



gkixk-1 



1 
At 1 



At At'^/2 
^^"^ ' ^'^ " [At^/2 At^/3 ■ 

The measurement model for the conditional mean of 
measurement Zk given state Xk is defined by 

hkixk) = [0 f] Xk = X2,k , Rk^ <J^ , 

where X2,k denotes the second component of Xk, <J^ = 0.25 
for all experiments, and the degrees of freedom parameter 
k was set to 4 for the Student's t methods. 
The measurements {zfc} were generated as a sample from 

Zk = 2:2(tfe) /S= 1,...,100, tk=QMTTXk 

where the measurement noise Vk was generated according 
to the following schemes. 

(1) (Nominal): Vk ^ N(0,0.25) 

(2) (Contaminating Normal) Ufc - (1 - p)N(0, 0.25) + 
pN(0, 0), for p e {0.1, 0.2, 0.5} and (/> € {10, 100}. 

(3) (Contaminating Uniform) Same as above, but with 
U[— 10, 10] replacing normal contamination, and p = 
0.1,0.2,0.5. 

The results for our simulated fitting are presented in 
Table 1. Each experiment was performed 1000 times, and 
we provide the median Mean Squared Error (MSE) value 
and a quantile interval containing 95% of the results. The 
MSE is defined by 



1 



N 
k=l 



'Xi{tk) - Xi^kf + [X2itk) - X2,k] 



(6.1) 



where {xk} is the corresponding estimating sequence. 

Note that both of the smoothers perform as well as the 
(optimal) L2-smoother at nominal conditions, and that 
both continue to perform at that same level for a variety 
of outlier generating scenarios. The T-smoother always 
performs at least as well as the €i-smoother, and it gains 
an advantage when either the probability of contamination 
is high, or the contamination is uniform. This is likely due 
to the re-descending influence function of the Student's t- 
distribution — the smoother effectively throws out bad 
points rather than simply decreasing their impact to a 
certain threshold, as is the case for the ^i-smoother. 

Nonlinear Example In this section, we present results for 
the Van Der Pol oscillator (VDP), described in detail in 
[Aravkin et al., 2011b]. The VDP oscillator is a coupled 
nonlinear ODE 

Xi{t) = X2{t) and X2{t) - fi[l-Xi{tf]X2{t)-Xi{t) 
The process model here is the Euler approximation for 
X(tk) given X{t^i): 

I . _ ( Xi^k-l + X2,k^lAt 

S^^""^-^' - \ X2.k-x + {/i[l - x\k\x^,k - XukjAt 
For this simulation, the ground truth is obtained from a 
stochastic Euler approximation of the VDP. To be specific, 
with ^ = 2, = 164 and A< = 16/N, the ground 
truth state vector Xk at time tk = kAt is given by 
xo = (0, -0.5)"^ and for fc = 1, . . . , iV, Xfe = gk{xk-i)+Wk, 
where {wk} is a realization of independent Gaussian noise 
with variance 0.01. 

The £i-Laplace smoother was shown to have superior per- 
formance to the Gaussian nonlinear smoother in [Aravkin 
et al., 2011b], both implemented in [Bell et al., 2007-2011]. 
We compared the performance of the nonlinear T-robust 





Fig. 2. Smoother fits for X-component (left) and Y- 
componcnt (right) of the Van Der Pol oscillator, with 
70% outliers Af(0, 100). Black solid line is truth, 
magenta dash-dot is the ix smoother result, and 
blue dashed line is T-robust. Measurements on X- 
component are shown as dots, with outliers outside 
the range [—5, 5] plotted on top and bottom axes. 

and nonlinear ^i-Laplace smoothers, and found that T- 
robust gains an advantage in the extreme cases of 70% 
outliers (see Figure 2), and otherwise is hard to distinguish 
from the £i-Laplace for 40% or fewer outliers. 

6.2 T- Trend Smoother 

We present a proof of concept result for the T- Trend 
smoother, in particular considering two Monte Carlo stud- 
ies of 200 runs. In the first study, the state vector, as 
well as the process and measurement models, are exactly 
the same as in the linear example used for the T-Robust 
smoother in the previous subsection. At any run, X2 has 
to be reconstructed from 20 measurements corrupted by 
a white Gaussian noise of variance 0.05 and collected on 
[0, 2tt] using a uniform sampling grid. The top panel of 
Figure 3 reports the boxplot of the 200 root-MSE errors, 

with MSE defined by \J J2k=il^^(^k) ~ X2.k^ ^ obtained 
using the L2-, ^\-^ and T- Trend Kalman smoothers, while 
the top panel of Figure 4 displays the estimate obtained 
in a single run. It is apparent that the performance of the 
three estimators is very similar. 

The second experiment is identical to the first one except 
that we introduce a 'jump' at the middle of the sinusoidal 
wave. The bottom panel of Figure 3 reveals the superior 
performance of the T- Trend smoother under these per- 
turbed conditions. Further, the bottom panel of Figure 
4 shows that the estimate achieved by the L2-smoother 
(dashed-line) does not follow the jump well (the true state 
is the solid line). The ^i-smoother (dashdot) does a better 
job than the L2-smoother, but the T-trcnd smoother out- 
performs the ^i-smoother, following the jump very closely 
while still providing a good solution along the rest of the 
path. 

7. CONCLUSION 

We have described two new nonlinear smoothers, called T- 
Robust and T- Trend, which efhciently obtain the MAP es- 
timates of the states in a state-space model with Student's 



Nominal conditions 



Perturbed conditions 




Fig. 3. T- Trend Smoother: Monte Carlo simulation. Box- 
plot of errors obtained using Gaussian, Laplace and 
Student's T Kalman smoother under nominal (top) 
and perturbed (bottom) conditions. 



Nominal conditions 




Fig. 4. T- Trend Smoother: results from a Monte Carlo 
run under nominal (top) and perturbed (bottom) 
conditions using L2 (dashed), Li (dashdot) and T — 
Trend (thin line) smoother. The thick line is the true 
state. 

t measurement and Student's t process noise, respectively. 
The T-Robust smoother compares favorably to the ii- 
Laplace smoother — the smoothers are comparable for 
most error scenarios presented, and the T-smoother has 
an advantage for high levels of contamination because 
of the re-descending influence function of the Student's 
t-distribution. In addition, although it involves a non- 
convex objective, it is simple to implement using minor 
modifications to the L2 nonlinear smoother. 
The T-Trend smoother was designed for tracking signals 
with potential sudden changes, and has many potential 
applications in navigation and financial trend tracking. It 
was demonstrated to follow a fast jump in the state better 
than a smoother with a convex penalty on the innovation. 
Just as the T-robust smoother, it can be implemented with 
minor modifications to an L2 nonlinear smoother. 
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8. APPENDIX: FULL CONVERGENCE THEORY 

If {x'^} is a sequence generated by the Gauss-Newton 
Algorithm 5.1 with initial point x^ and e = 0, then one of 
the following must occur: 



(i) The algorithm terminates finitely at a point x'^ with 

e dKix"). 

(ii) lim^g/ Aj, = for every subsequence / for which the 
set {d" I i/ G / } is bounded. 

(iii) The sequence \\d'^\\ diverges to +00. 

Moreover, if x is any cluster point of a subsequence / C Z_|_ 
such that the subsequence {d'^\u e /} is bounded, then 

G dK{x). 

Proof: Assertions (i), (ii), and (iii) are a restatement of 
[Burke, 1985, Theorem 2.4] in our context, where the sets 
£)y in [Burke, 1985, Theorem 2.4] are given by D^, = 
D{x" ,r}). The requirement that p be Lipschitz continuous 
on the set {(u) | p(u,) < K{x^^) } is an immediate conse- 
quence of the fact that p is coerc;ive its domain, so this set is 
compact. This completes the proof of (i), (ii), and (iii). By 
compactness, the matrices H are uniformly continuous in x 
on this set. Suppose that a; is a cluster point of a sequence 

1 C Z+ for which {cF} is bounded. Since i is a clus- 
ter point of {a;''}, we can take a convergent subsequence 
along which y } are still bounded, and by continuity H" 
converge to H = H{x). By Bolzano- Weierstrass, we can 
then find a subsequence J C I and vector d G R^" such 
that {x'',d'') -^j (.X, d). Fix any other point d e K^". By 
construction, we have 



A, = p(^Fix-')+F^'\x'')d''^ + 

<v(^p (F{x'') + F^^\x'')d 
Taking the limit over J gives 



p{Fixn) 



1 



\\d\\i.-p{Fixn) 







< 



p(^Fix)+F<^'\x)d)+-\\drH-piFix)) 

l\\l-p{F{x)) 



r](p (F(x) + FW(a;)d) 



But d was an arbitrary point in R-'^", so in particular we 
must have A* {x) =0. ■ 

A stronger convergence result is possible under stronger 
assumptions on F and F^^\ or on the sequence H". Fix 

x° e K^", and define 

A^{u\p{u)<Kix°)} , (8.1) 

where p is as above. Note that A is compact, since p is 
coercive on its domain. 

Corollary 8.1. Suppose that F"^(A) = {x \ F{x) e A} is 
bounded, and either of the following assumptions hold: 

< A„,i„ < eigiH-') (8.2) 
Null (F(i)(a;)) = {0} V a; e F"^(A) . (8.3) 

If {x"} is a sequence generated by Algorithm 5.1 with 

initial point x° and e = 0, then {x'^} and {d'^} are bounded 
and either the algorithm terminates finitely at a point x'^ 
with e dK{x'^), or A,^ — >• as — >■ 00, and every cluster 
point X of the sequence {x"} satisfies G dK{x). 

Proof: First note that F^^(A) is closed since F is con- 
tinuous and A is compact, therefore F~^(A) is compact. 
Hence co (F~^(A)) is also compact. Therefore, F^^^ is uni- 
formly continuous on co (F~^(A)) which implies that the 
hypotheses of Theorem 5.2 are satisfied, and so one of (i)- 
(iii) must hold. If (i) holds we are done, so we will assume 



111.^ • 



m.<p{F{x)) (8.4) 



that the sequence {x'^} is infinite. Since {x'^} C F~^{\), 
this sequence is bounded. We now show that the sequence 
{d'^} of search directions is also bounded. 
Suppose that (8.2) holds. For any direction d" , note that 
d" solves 

mm p (^F{x)+F'-^\x)d}j 
Therefore we have 

p{F{x)+F(^\x)d^) . 2, 

since we can achieve p(F{x)) with d = 0. Since p > 0, 

we must have {^Wd^Wj^,,} < p{F{x)), hence are 

bounded, and d'^ are bounded by (8.2). 

Suppose instead that (8.3) holds. We claim that there 

exists K > such that 

K\\d\\ < ||F(i)(a;)d|| V d G M"^" and x G ^-^(A) . 

Indeed, if this were not the case, then there would exist 
sequences {y^} C ^-^(A) and {rf*} C M^" such that 
and 

\\d'\\/i. > \\F^^\y')d'\\ Vz = l,2,... . 

The set F^^(A) is compact, hence there exists a subse- 
quence _J C Z_|_, vector x G F~^(A), and vector d G M''^" 
with ||d|| = 1, such that x" -^j x and rf'/IM'H -^J d. It 
follows from the inequality above that 



1 



> 



F^^\x') 



Take the limit with respect to the subsequence J we obtain 
> l|F(i)(5)dl|. Thus d is in the kernel of F^'^^x) and 
d^O. This contradicts (8.3) and thereby proves the claim. 
For any direction d" , (8.4) hods and F{x) + F^'^\x)d'' G A 
since > 0. Since A is compact, and {F{x''), F{x'') + 

F'-^\x'')d'^} C A by construction, there is an a > such 
that ||u|| < a for all « G A and for = 1, 2, . . . , 

Kl|dni<|| F^'H^ld'^ II 

< II F(a;'') + F^^\x'')d'' || + ||F(a;'')|| < 2a . 

Hence the sequence {d"} of search directions is bounded. 
In both cases. Theorem 5.2 tells us that A^ — ^ as 1/ ^ 00. 
The final statement of the corollary follows immediately 
from the final statement of Theorem 5.2. ■ 



